35 research outputs found

    Making Asynchronous Stochastic Gradient Descent Work for Transformers

    Get PDF
    Asynchronous stochastic gradient descent (SGD) is attractive from a speed perspective because workers do not wait for synchronization. However, the Transformer model converges poorly with asynchronous SGD, resulting in substantially lower quality compared to synchronous SGD. To investigate why this is the case, we isolate differences between asynchronous and synchronous methods to investigate batch size and staleness effects. We find that summing several asynchronous updates, rather than applying them immediately, restores convergence behavior. With this hybrid method, Transformer training for neural machine translation task reaches a near-convergence level 1.36x faster in single-node multi-GPU training with no impact on model quality

    Approximating neural machine translation for efficiency

    Get PDF
    Neural machine translation (NMT) has been shown to outperform statistical machine translation. However, NMT models typically require a large number of parameters and are expensive to train and deploy. Moreover, its large model size makes parallel training inefficient due to costly network communication. Likewise, distributing and locally running the model for a client-based NMT model such as a web browser or mobile device remains challenging. This thesis investigates ways to approximately train an NMT system by compressing either the gradients or the parameters for faster communication or reduced memory consumption. We propose a gradient compression technique that exchanges only the top 1% of the most significant gradient values while delaying the rest to be considered for the next iteration. This method reduces the network communication cost by 50-fold but causes noisy gradient updates. We also find that Transformer–the current state-of-the-art NMT architecture–is highly sensitive to noisy gradients. Therefore, we extend the compression technique by restoring the compressed gradient with locally-computed gradients. We obtained a linear scale-up in parallel training without sacrificing model performance. We also explore transfer learning as a better method of initialising the training. With transfer learning, the model converges faster and can be trained with more aggressive hyperparameters. Lastly, we propose a log-based quantisation method to compress the model size. Models are quantised to 4-bit precision with no noticeable quality degradation after re-training combined with reserving the quantisation errors as feedback

    Compressing Neural Machine Translation Models with 4-bit Precision

    Get PDF

    Mintaka: A Complex, Natural, and Multilingual Dataset for End-to-End Question Answering

    Full text link
    We introduce Mintaka, a complex, natural, and multilingual dataset designed for experimenting with end-to-end question-answering models. Mintaka is composed of 20,000 question-answer pairs collected in English, annotated with Wikidata entities, and translated into Arabic, French, German, Hindi, Italian, Japanese, Portuguese, and Spanish for a total of 180,000 samples. Mintaka includes 8 types of complex questions, including superlative, intersection, and multi-hop questions, which were naturally elicited from crowd workers. We run baselines over Mintaka, the best of which achieves 38% hits@1 in English and 31% hits@1 multilingually, showing that existing models have room for improvement. We release Mintaka at https://github.com/amazon-research/mintaka.Comment: Accepted at COLING 202

    LLM-powered Data Augmentation for Enhanced Cross-lingual Performance

    Full text link
    This paper explores the potential of leveraging Large Language Models (LLMs) for data augmentation in multilingual commonsense reasoning datasets where the available training data is extremely limited. To achieve this, we utilise several LLMs, namely Dolly-v2, StableVicuna, ChatGPT, and GPT-4, to augment three datasets: XCOPA, XWinograd, and XStoryCloze. Subsequently, we evaluate the effectiveness of fine-tuning smaller multilingual models, mBERT and XLMR, using the synthesised data. We compare the performance of training with data generated in English and target languages, as well as translated English-generated data, revealing the overall advantages of incorporating data generated by LLMs, e.g. a notable 13.4 accuracy score improvement for the best case. Furthermore, we conduct a human evaluation by asking native speakers to assess the naturalness and logical coherence of the generated examples across different languages. The results of the evaluation indicate that LLMs such as ChatGPT and GPT-4 excel at producing natural and coherent text in most languages, however, they struggle to generate meaningful text in certain languages like Tamil. We also observe that ChatGPT falls short in generating plausible alternatives compared to the original dataset, whereas examples from GPT-4 exhibit competitive logical consistency.Comment: EMNLP 2023 Main Conferenc

    In Neural Machine Translation, What Does Transfer Learning Transfer?

    Get PDF
    Transfer learning improves quality for low-resource machine translation, but it is unclear what exactly it transfers. We perform several ablation studies that limit information transfer, then measure the quality impact across three language pairs to gain a black-box understanding of transfer learning. Word embeddings play an important role in transfer learning, particularly if they are properly aligned. Although transfer learning can be performed without embeddings, results are sub-optimal. In contrast, transferring only the embeddings but nothing else yields catastrophic results. We then investigate diagonal alignments with auto-encoders over real languages and randomly generated sequences, finding even randomly generated sequences as parents yield noticeable but smaller gains. Finally, transfer learning can eliminate the need for a warm-up phase when training transformer models in high resource language pairs

    Bactrian-X : A Multilingual Replicable Instruction-Following Model with Low-Rank Adaptation

    Full text link
    Instruction tuning has shown great promise in the field of natural language processing. However, the research on multilingual instruction tuning has been limited due to the scarcity of high-quality instruction-response datasets. To address this gap, we present Bactrian-X, a comprehensive multilingual parallel dataset of 3.4 million instruction-response pairs across 52 languages. Leveraging this dataset, we train a set of adapters using low-rank adaptation (LoRA), which are lightweight components seamlessly integrated with foundational models. These adapters have a significantly smaller parameter count than the base model, making them easily replaceable and usable as plug-ins for different languages or language groups. Through extensive experiments on 52 languages, we demonstrate the superior performance of our models in various multilingual evaluation settings. Our proposed models outperform both the vanilla models and the existing instruction-tuned models. The code and models are publicly available at https://github.com/mbzuai-nlp/bactrian-x
    corecore